ABanditNAS: Anti-Bandit for Neural Architecture Search
93
Sampling operations
Anti-Bandit LCB
Bi
Bj
Reducing the search space
1
M
m m v
K
u
¦
1
(
1)
M
m m v
K
u
¦
CONV
3x3
CONV
5x5
MAX POOL
3x3
Identity
CONV
3x3
Depth-Wise
CONV 3x3
CONV
5x5
CONV
3x3
MAX POOL
3x3
Depth-Wise
CONV 3x3
Identity
CONV
5x5
MAX POOL
3x3
Identity
Depth-Wise
CONV 3x3
ȳሺ݅ǡ݆ሻ
L
log
(i,j)
(i,j)
k
k,t
(i,j)
k,t
2
N
s (o
)= m
n
Anti-Bandit UCB
log
U
(i, j)
(i, j)
k
k,t
(i, j)
k,t
2
N
s (o
)= m
+
n
FIGURE 4.1
ABanditNAS is divided into two steps: sampling using LCB and abandoning using UCB.
they are confirmed to be bad. Meanwhile, when well trained, weight-free operations will
be compared only with parameterized operations. On the other hand, with the operation
pruning process, the search space becomes smaller and smaller, leading to an efficient search
process.
4.2.1
Anti-Bandit Algorithm
Our goal is to search for network architectures effectively and efficiently. However, a dilemma
exists for NAS about whether to maintain a network structure that offers significant rewards
(exploitation) or to investigate further other network structures (exploration). Based on
probability theory, the multi-armed bandit can solve the aforementioned exploration-versus-
exploitation dilemma, which makes decisions among competing choices to maximize their
expected gain. Specifically, we propose an anti-bandit that chooses and discards the arm k
in the trial based on
˜rk −˜δk ≤rk ≤˜rk + ˜δk,
(4.1)
where rk, ˜rk and ˜δk are the true reward, the average reward, and the estimated vari-
ance obtained from arm k. ˜rk is the value term that favors actions that historically
perform well, and ˜δk is the exploration term that gives actions an exploration bonus.
˜rk −˜δk and ˜rk + ˜δk can be interpreted as the lower and upper bounds of a confidence
interval,
The traditional UCB algorithm, which optimistically substitutes ˜rk+˜δ for rk, emphasizes
exploration; however, ignores exploitation. Unlike the UCB bandit, we further exploited the
LCB and UCB to balance exploration and exploitation. A smaller LCB usually has little
expectations but significant variance and should be given a larger chance to be sampled for
more trials. Then, based on the observation that the worst operations in the early stage
usually have worse performance at the end [291], we use UCB to prune the operation with
the worst performance and reduce the search space. In summary, we adopt LCB, ˜rk −˜δ,
to sample the arm, which should be further optimized, and use UCB, ˜rk + ˜δ, to abandon
the operation with the minimum value. Because the variance is bounded and converges, the
operating estimate value is always close to the true value and gradually approaches the true
value as the number of trials increases. Our anti-bandit algorithm overcomes the limitations
of an exploration-based strategy, including levels of understanding and suboptimal gaps. The
definitions of the value term and the variance term and the proof of our proposed method
are shown below.
Definition 1. If an operation on arm k has been recommended nk times, rewardi is the
reward on arm k on all trails. The value term of anti-bandit is defined as follows
˜rk =
rewardi
nk
.
(4.2)